Rich morphology based n-gram language models for Arabic
نویسندگان
چکیده
In this paper we investigate the use of rich morphology such as word segmentation, part-of-speech tagging and diacritic restoration to improve Arabic language modeling. We enrich the context by performing morphological analysis on the word history. We use neural network models to integrate this additional information, due to their ability to handle long and enriched dependencies. We experimented with models with increasing order of morphological features, starting with Arabic segmentation, and later adding part of speech labels as well as words with restored diacritics. Experiments on Arabic broadcast news and broadcast conversations data showed significant improvements in perplexity, reducing the baseline N-gram and the neural network N-grammodel perplexities by 35% and 31% respectively.
منابع مشابه
On the use of morphological constraints in n-gram statistical language model
State of the art Speech Recognition systems use statistical language modeling and in particular N-gram models to represent the language structure. The Arabic language has a rich morphology, which motivates the introduction of morphological constraints in the language model. Class-based N-gram models have shown satisfactory results, especially for language model adaptation and training from redu...
متن کاملDiscriminative n-gram language modeling for Turkish
In this paper Discriminative Language Models (DLMs) are applied to the Turkish Broadcast News transcription task. Turkish presents a challenge to Automatic Speech Recognition (ASR) systems due to its rich morphology. Therefore, in addition to word n-gram features, morphology based features like root n-grams and inflectional group n-grams are incorporated into DLMs in order to improve the langua...
متن کاملMorphology-based language modeling for arabic speech recognition
Language modeling is a difficult problem for languages with rich morphology. In this paper we investigate the use of morphology-based language models at different stages in a speech recognition system for conversational Arabic. Classbased and single-stream factored language models using morphological word representations are applied within an N-best list rescoring framework. In addition, we exp...
متن کاملCompositional Morphology for Word Representations and Language Modelling
This paper presents a scalable method for integrating compositional morphological representations into a vector-based probabilistic language model. Our approach is evaluated in the context of log-bilinear language models, rendered suitably efficient for implementation inside a machine translation decoder by factoring the vocabulary. We perform both intrinsic and extrinsic evaluations, presentin...
متن کاملEnriching Word Vectors with Subword Information
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new ap...
متن کامل